gh-142183: Change data stack to use a resizable array#148681
gh-142183: Change data stack to use a resizable array#148681dpdani wants to merge 5 commits intopython:mainfrom
Conversation
|
@pablogsal can you take a look this week? 🙏 |
I can try but I have some other PRs first in my review queue :( |
There was a problem hiding this comment.
Thanks for working on this. I am worried about a couple of consequences that I think we should account for before continuing with this:
A minor concern is that this seems to change Python stack memory from “roughly current depth” usage to “high-water mark for the lifetime of the thread”. In the old chunked implementation, deep recursion allocated additional 16 KiB chunks and then released most of them while unwinding. In this version, resize_stack() keeps previous stack chunks linked from stack_chunk_list, and _PyThreadState_PopFrame() only moves stack_top; the chunks are not freed until the thread state is deleted.
This doesn't seem to be a lot so I am not too worried.
The worse concern I have I’m also concerned about profiler/debugger consequences. The old stack chunk layout allowed _remote_debugging/external unwinders to bulk-copy stack chunks cheaply. With this change, the active frame chain can span older chunks while only the newest chunk is copied in the new _remote_debugging path, so older frames fall back to individual remote reads. For a 1000-frame stack I measured:
- old no-cache unwinding: 4 memory reads, ~1.2 KiB read
- new no-cache unwinding: 966 memory reads, ~85.8 KiB read
Tachyon is probably mostly insulated because it uses frame caching, but first samples, cache-disabled paths, fallback paths, and external tools still care. Other profilers like austin currently hard-codes the old _PyStackChunk layout (previous, size, top, data), while this patch changes it to (size, previous, data), so those tools need explicit updates.
Given this and the potential gains I do not find the tradeoff very convincing....
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
| chunk_addr = GET_MEMBER(uintptr_t, chunks[count].local_copy, offsetof(_PyStackChunk, previous)); | ||
| count++; | ||
| // Process this chunk | ||
| if (process_single_stack_chunk(unwinder, chunk_addr, &chunks[count]) < 0) { |
There was a problem hiding this comment.
Unnless I’m missing something, stopping after a single chunk here looks like a large perf regression for the profiler. The runtime still has a linked chunk chain via stack_chunk_list -> previous, but this now only copies the newest chunk. If the active frame chain spans older chunks, find_frame_in_chunks() misses those frames and we fall back to parse_frame_object(), which does one remote memory read per frame.
|
Unfortunately the more I think about it the less I like it: this model is harder to reason about than the previous linked-list model, and I think that matters because it creates a very easy footgun: the current chunk looks like the current stack backing store, but it is not. Existing live frames may still be in older chunks, while newer frames are in the newest chunk at a matching logical offset. So any code that treats The risk is not just external profilers. This makes future runtime/debugger code more fragile because pointer validity and frame ownership now depend on searching the whole chunk chain, not checking the current chunk. |
|
@pablogsal
There seems to be a misunderstanding here. No two frames will have the same offset.
The stack is a linked list of frames, not chunks. Some of which (generator and coroutine frames) aren't in chunks at all, so tools already need to handle pointers outside of the current chunk. Overall, I don't see how this really changes anything for an out-of-process profiler: Copy all the chunks, then traverse the stack. Also, note that the lower, unused part of the current chunk in a stack with multiple chunks will be untouched, so a profiler should be able to detect when it needs to cross to another chunk.
That is trade off that tools and libraries make: either they use stable API/ABIs, or they probe CPython internals. If they do the latter, they will need updating every release. @P403n1x87 would this be a problem for you? |
The issue I was pointing at is the current The fix here is to either restore eager copying of the full
Also, you’re right that “same logical offset” was poor wording. I meant that the new chunk starts allocating at the old stack depth, leaving the lower part of the new chunk unused; not that two frames have the same offset. |
|
Here is a repro: In main we can see: with this PR: Notice |
Yes, that's my bad. I made the smallest changes possible to the remote debugging module to get the PR working, but didn't inspect further improvements. Maybe it can be done in a follow-up PR by people more knowledgeable on the module? Or would you consider that a blocker? |
This is a blocker. This PR adds a regression and that's not acceptable. |
8115b28 to
239e2eb
Compare
|
@dpdani I pushed a fix for the concrete I also added a regression test to make sure deep stacks are resolved from copied chunks rather than falling back to parsing frames individually from remote memory. |
|
That said, I still think this solution is very confusing and too complex. It is harder to reason about where frames live, which chunks are relevant, and what invariants the implementation can rely on. I am worried this makes future changes easier to get subtly wrong. |
239e2eb to
99e9e44
Compare
|
I investigated an alternative implementation in #149097: instead of replacing the stack with the resizable-array model, it keeps the current chunked- stack invariants and extends the existing one-chunk cache into a small bounded per-thread cache. I think this is a better direction because it fixes the allocator-thrashing issue without changing where live frames can reside, without changing The cache is intentionally bounded: currently up to I cehcked the original repro and it no longer shows per-branch mmap/munmap churn. |
This PR changes the implementation of the Python stack to use a resizable array. This avoids the problem of calls that frequently cause the
datastack_top(now calledstack_top) pointer to switch between allocations.After resizing, previous array allocations are not immediately freed because that would cause issues for various bits around the VM still pointing into them, and are instead freed along with the tstate.
During resizing, the previous contents of the stack are not copied into the new allocations, and instead the memory of the previous allocation is still used. Subsequently, popping and pushing frames, the new frames will always be residing on the new stack chunk allocation.
Overall it results in a ±1% performance change (within the noise range), but it avoids degenerate cases for any number of frames. I am also told it would allow further optimizations in the JIT.